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Abstract 

Driven by a large number of potential applications in areas like bioin- 
formatics, information retrieval and social network analysis, the problem 
setting of inferring relations between pairs of data objects has recently 
been investigated quite intensively in the machine learning community. 
To this end, current approaches typically consider datasets containing 
crisp relations, so that standard classification methods can be adopted. 
However, relations between objects like similarities and preferences are 
often expressed in a graded manner in real-world applications. A gen- 
eral kernel-based framework for learning relations from data is introduced 
here. It extends existing approaches because both crisp and graded rela- 
tions are considered, and it unifies existing approaches because different 
types of graded relations can be modeled, including symmetric and recip- 
rocal relations. This framework establishes important links between recent 
developments in fuzzy set theory and machine learning. Its usefulness is 
demonstrated through various experiments on synthetic and real-world 
data. 



1 Introduction 

Relational data occurs in many predictive modeling tasks, such as forecasting the 
winner in two- player computer games [7| , predicting proteins that interact with 
other proteins in bioinformatics [67] , retrieving documents that are similar to a 
target document in text mining [68| , investigating the persons that are friends 



of each other on social network sites 57 , etc. All these examples represent fields 



of application in which specific machine learning and data mining algorithms 
have been successfully developed to infer relations from data; pairwise relations, 
to be more specific. 

The typical learning scenario in such situations can be summarized as fol- 
lows. Given a dataset of known relations between pairs of objects and a feature 
representation of these objects in terms of variables that might characterize the 
relations, the goal usually consists of inferring a statistical model that takes two 
objects as input and predicts whether the relation of interest occurs for these 
two objects. Moreover, since one aims to discover unknown relations, a good 



learning algorithm should be able to construct a predictive model that can gen- 
eralize for unseen data, i.e., pairs of objects for which at least one of the two 
objects was not used to construct the model. As a result of the transition from 
predictive models for single objects to pairs of objects, new advanced learning 
algorithms need to be developed, resulting in new challenges with regard to 
model construction, computational tractability and model assessment. 

As relations between objects can be observed in many different forms, this 
general problem setting provides links to several subfields of machine learning. 



like statistical relational learning 13 , graph mining 61 , metric learning 66 
and preference learning 25 . More specifically, from a graph-theoretic perspec- 
tive, learning a relation can be formulated as learning edges in a graph where 
the nodes represent information of the data objects; from a metric learning 
perspective, the relation that we aim to learn should satisfy some well-defined 
properties like positive definiteness, transitivity or the triangle inequality; and 
from a preference learning perspective, the relation expresses a (degree of) pref- 
erence in a pairwise comparison of data objects. 

The topic of learning relations between objects is also closely related to 
recent developments in fuzzy set theory. This article will elaborate on these 
connections via two important contributions: (1) the extension of the typical 
setting of learning crisp relations to real-valued and ordinal-valued relations 
and (2) the inclusion of domain knowledge about relations into the inference 
process by explicit modeling of mathematical properties of these relations. For 
algorithmic simplicity, one can observe that many approaches only learn crisp 
relations, that is relations with only and 1 as possible values, so that standard 
binary classifiers can be modified. In this context, consider examples as inferring 
protein-protein interaction networks or metabolic networks in bioinformatics 
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However, graded relations are observed in many real-world applications 17 
resulting in a need for new algorithms that take graded relational information 
into account. Furthermore, the properties of graded relations have been inves- 
tigated intensively in the recent fuzzy logic literaturcj^ and these properties are 
very useful to analyze and improve current algorithms. Using mathematical 
properties of graded relations, constraints can be imposed for incorporating do- 
main knowledge in the learning process, to improve predictive performance or 
simply to guarantee that a relation with the right properties is learned. This 
is definitely the case for properties like transitivity when learning similarity 
relations and preference relations - see e.g. [9|[lO|[T5|[55] , but even very basic 
properties like symmetry, antisymmetry or reciprocity already provide domain 
knowledge that can steer the learning process. For example, in social network 
analysis, the notion "person A being a friend of person B" should be considered 
as a symmetric relation, while the notion "person A defeats person B in a chess 
game" will be antisymmetric (or, equivalently, reciprocal). Nevertheless, many 



^ Often the term fuzzy relation is used in the fuzzy set literature to refer to graded relations. 
However, fuzzy relations should be seen as a subclass of graded relations. For example, 
reciprocal relations should not be considered as fuzzy relations, because they often exhibit a 
probabilistic semantics rather than a fuzzy semantics. 
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examples exist, too, where neither symmetry nor antisymmetry necessarily hold, 
like the notion "person A trusts person B" . 

In this paper we present a general kernel-based approach that unifies all 
the above cases into one general framework where domain knowledge can be 
easily specified by choosing a proper kernel and model structure, while different 
learning settings are distinguished by means of the loss function. Let Q{v, v') 
be a binary relation on an object space V, then the following learning settings 
will be considered in particular: 

• Crisp relations: when the restriction is made that Q : — > {0, 1}, we 
arrive at a binary classification task with pairs of objects as input for the 
classifier. 

• [0, l]-valued relations: here it is allowed that relations can take the form 
Q : —?' [0, 1], resulting in a regression type of learning setting. The re- 
striction to the interval [0, 1] is predominantly made because many math- 
ematical frameworks in fields like fuzzy set theory and decision theory are 
built upon such relations, using the notion of a fuzzy relation, but in gen- 
eral one can account quite easily for real-graded relations by applying a 
scaling operation from M to [0, 1]. 

• Ordinal-valued relations: situated somewhat in the middle between the 
other two settings, here it is assumed that the actual values of the relation 
do not matter but rather the provided order information should be learned. 

Furthermore, one can integrate different types of domain knowledge in our 
framework, by guaranteeing that certain properties are satisfied. The following 
cases can be distinguished: 

• Symmetric relations. Applications arise in many domains and metric 
learning or learning similarity measures can be seen as special cases that 
require additional properties to hold, such as the triangle inequality for 
metrics and positive definiteness or transitivity properties for similarity 
measures. As shown below, learning symmetric relations can be inter- 
preted as learning edges in an undirected graph. 

• Reciprocal or antisymmetric relations. Applications arise here in domains 
such as preference learning, game theory and bioinformatics for represent- 
ing preference relations, choice probabilities, winning probabilities, gene 
regulation, etc. We will provide a formal definition below, but, given a 
rescaling operation from M to [0, 1], antisymmetric relations can be con- 
verted into reciprocal relations. Similar to symmetric relations, transitiv- 
ity properties typically guarantee additional constraints that are definitely 
required for certain applications. It is, for example, well known in decision 
theory and preference modeling that transitive preference relations result 
in utility functions j6j|36|. Learning reciprocal or antisymmetric relations 
can be interpreted as learning edges in a directed graph. 

• Ordinary binary relations. Many applications can be found where nei- 
ther symmetry nor reciprocity holds. From a graph inference perspective. 
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learning such relations should be seen as learning the edges in a bidirec- 
tional graph, where edges in one direction do not impose constraints on 
edges in the other direction. 

Indeed, the framework that we propose below strongly relies on graphs, where 
nodes represent the data objects that are studied and the edges represent the 
relations present in the training set. The weights on the edges characterize the 
values of known relations, while unconnected nodes indicate pairs of objects 
for which the unknown relation needs to be predicted. The left graph in Fig- 
ure [T] visualizes a toy example representing the most general case where neither 
symmetry nor reciprocity holds. Depending on the application, the learning 
algorithm should try to predict the relations for three types of object pairs: 

• pairs of objects that are already present in the training dataset by means 
of other edges, like the pair (A,B), 

• pairs of objects for which one of the two objects occurs in the training 
dataset, like the pair (E,F), 

• pairs of objects for which none of the two objects is observed during train- 
ing, like the pair (F,G). 

The graphs on the right-hand side in Figure [l] show examples of specific types 
of relations that are covered by our framework. The differences between these 
relations will become more clear in the following sections. 

2 General framework 

2.1 Notation and basic concepts 

Let us start with introducing some notations. We assume that the data is 
structured as a graph G = (V,£, Q), where V corresponds to the set of nodes v 
and £ CV^ represents the set of edges e, for which training labels are provided in 
terms of relations. Moreover, these relations are represented by training weights 
j/e on the edges, generated from an unknown underlying relation Q : — >■ [0, 1]. 
Relations are required to take values in the interval [0, 1] because some properties 
that we need are historically defined for such relations, but an extension to real- 
graded relations /i : — >■ M can always be realized. Consider b € and an 
increasing isomorphism a : [—5, b] [0, 1] that satisfies a{x) — I — a{~x), then 
we consider the M — > [0, 1] mapping V defined by: 

To, ifx <-b 

V{x) = I (7{x), if-b<x<b 

[ 1, if 6 < a; 

and its inverse V^^ = a^^. 
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Figure 1: Left: example of a multi-graph representing the most general case, 
where no additional properties of relations are assumed. Right: examples of 
eight different types of relations in a graph of cardinality three. The following 
relational properties are illustrated: (C) crisp, (G) graded, (R) reciprocal, (S) 
symmetric, (T) transitive and (I) intransitive. For the reciprocal relations, (I) 
refers to a relation that does not satisfy weak stochastic transitivity, while (T) 
is showing an example of a relation fulfilling strong stochastic transitivity. For 
the symmetric relations, (I) refers a relation that does not satisfy T-transitivity 
w.r.t. the Lukasiewicz t-norm Ti,{a, h) — max(a + 6—1, 0), while (T) is showing 
an example of a relation that fulfills T-transitivity w.r.t. the product t-norm 
Tp(a, h) = ab. See Section 4 for formal definitions of transitivity. 
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Any real-valued relation /i : — ?> M can be transformed into a [0, l]-valued 
relation Q as follows: 

Q{v,v')^V{h{v,v')), ^{v,v')&V\ (1) 

and conversely by means of V^^. In what follows we tacitly assume that V has 
been fixed. 

Following the standard notations for kernel methods, we formulate our learn- 
ing problem as the selection of a suitable function h GH, with H a certain hy- 
pothesis space, in particular a reproducing kernel Hilbert space (RKHS). More 
specifically, the RKHS supports in our case hypotheses /i : — )• M denoted as 

h{e) =w^$(e), 

with w a vector of parameters that needs to be estimated from training data, $ 
a joint feature mapping for edges in the graph (see below) and the transpose 
of a vector a. Let us denote a training dataset of cardinality q = \£\ as a set 
T = {(e, j/e) I e € f } of input-label pairs, then we formally consider the following 
optimization problem, in which we select an appropriate hypothesis h from T-l 
for training data T: 

/i^argmin- V/:(/i(e),2/e) + A||/j||^ (2) 

with C a given loss function, || • |||^ the traditional quadratic regularizer on the 
RKHS and A > a rcgularization parameter. According to the representer 
theorem [47, , any minimizer /i € "H of ([2]) admits a dual representation of the 
following form: 

Me) = w^$(e) = ^a,A'*(e,e), (3) 

with fle G M dual parameters, if* the kernel function associated with the RKHS 
and $ the feature mapping corresponding to if* and 

w = ^ ae$(e). 

We will alternate several times between the primal and dual representation for 
h in the remainder of this article. 

The primal representation as defined in ^ and its dual equivalent ([s]) yield 
an RKHS defined on edges in the graph. In addition, we will establish an RKHS 
defined on nodes, as every edge consists of a couple of nodes. Given an input 
space V and a kernel i^T : V x V — > M, the RKHS associated with K can be 
considered as the completion of 



6 



in the norm 

\\f\\K = 

where ft G E, m e N, G V. 

2.2 Learning arbitrary relations 

As mentioned in the introduction, both crisp and graded relations can be han- 
dled by our framework. To make a subdivision between different cases, a loss 
function needs to be specified. For crisp relations, one can typically use the 
hinge loss, which is given by: 

Cihie),y) = [l-yh{e)]+, 

with the positive part of the argument. Alternatively, one can opt to opti- 
mize a probabilistic loss function like the logistic loss: 

£(Me),y) =ln(l + exp(-2//i(e))). 

Conversely, if in a given application the observed relations are graded instead of 
crisp, other loss functions have to be considered. Hence, we will run experiments 
with a least-squares loss function: 

£{h{e),y) = {y,~h{e))\ (4) 

resulting in a regression type of learning setting. Alternatively, one could prefer 
to optimize a more robust regression loss like the e-insensitive loss, in case 
outliers are expected in the training dataset. 

So far, our framework does not differ from standard classification and regres- 
sion algorithms. However, the specification of a more precise model structure 
for ([2]) offers a couple of new challenges. In the most general case, when no 
further restrictions on the underlying relation can be specified, the following 
Kronecker product feature mapping is proposed to express pairwise interactions 
between features of nodes: 

^(e) = = (l){v) (g) (j>{v') , 

where (j) represents the feature mapping for individual nodes. A formal definition 
of the Kronecker product can be found in the appendix. As first shown in |3 , the 
Kronecker product pairwise feature mapping yields the Kronecker product edge 
kernel (a.k.a. the tensor product pairwise kernel) in the dual representation: 

if|(e,e) = Ki{v,v\v,v') = K^{v,v)K^{v' ,v') , (5) 

with K'^ the kernel corresponding to (/)■ 

This section aims to formally prove that the Kronecker product edge kernel is 
the best kernel one can choose, when no further domain knowledge is provided 
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about the underlying relation that generates the data. We claim that with 
an appropriate choice for K"^ , such as the Gaussian RBF kernel, the kernel 
generates a class Ti. of universally approximating functions for learning any 
type of relation. Armed with the definition of universality for kernels and the 



Stone- WeierstraB theorem 53 , we arrive at the following theorem concerning 



the Kronecker product pairwise kernels: 

Theorem 2.1. Let us assume that the space of nodes V is a compact metric 
space. If a continuous kernel K'^ is universal on V, then if* defines a universal 
kernel on £ . 

The proof can be found in the appendix. We would like to emphasize that 
one cannot conclude from the theorem that the Kronecker product pairwise ker- 
nel is the best kernel to use in all possible situations. The theorem only shows 
that the Kronecker product pairwise kernel makes a reasonably good choice, if 
no further domain knowledge about the underlying relation is known. Namely, 
the theorem says that given a suitable sample of data, the RKHS of the kernel 
contains functions that are arbitrarily close to any continuous relation in the 
uniform norm. However, the theorem does not say anything about how likely 
it is to have, as a training set, such a data sample that can represent the ap- 
proximating function. Further, the theorem only concerns graded relations that 
are continuous and therefore crisp relations and graded, discontinuous relations 
require more detailed considerations. 

Other kernel functions might of course outperform the Kronecker product 
pairwise kernel in applications where domain knowledge can be incorporated in 
the kernel function. In the following section we discuss reciprocity, symmetry 
and transitivity as three relational properties that can be represented by means 
of more specific kernel functions. As a side note, we also introduce the Cartesian 
pairwise kernel, which is formally defined as follows 

Kf;{v,v' ,v,v') = K't'{v',v')[v = v]+K'*{v,v)[v' = v'] , 

with [.] the indicator function, returning one when both elements are identical 
and zero otherwise. This kernel was recently proposed by [3l] as an alternative 
to the Kronecker product pairwise kernel. By construction, the Cartesian pair- 
wise kernel has important limitations, since it cannot generalize to couples of 
nodes for which both nodes did not appear in the training dataset. 



3 Special relations 

Thus, if no further information is available about the relation that underlies the 
data, one should definitely use the Kronecker product edge kernel. In this most 
general case, we allow that for any pair of nodes in the graph several edges can 
exist, in which an edge in one direction does not necessarily impose constraints 
on the edge in the opposite direction. Multiple edges in the same direction can 
connect two nodes, leading to a multi-graph as in Figure [ij where two different 
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edges in the same direction connect nodes D and E. This construction is re- 
quired to allow repeated measurements. However, two important subclasses of 
relations deserve further attention: reciprocal relations and symmetric relations. 



3.1 Reciprocal relations 

This subsection briefly summarizes our previous work on learning reciprocal 
relations [43]. Let us start with a definition of this type of relation. 

Definition 3.1. A binary relation Q : — !■ [0, 1] is called a reciprocal relation 
if for all {v,v') G it holds that Q{v,v') = 1 - Q{v',v). 

Definition 3.2. A binary relation /i : — > M is called an antisymmetric 
relation if for all {v,v') £ it holds that h{v,v') — —h{v',v). 

For reciprocal and antisymmetric relations, every edge e = (v, v') in a multi- 
graph like Figure [T] induces an unobserved invisible edge e^j — («', v) with 
appropriate weight in the opposite direction. The transformation operator V 
transforms an antisymmetric relation into a reciprocal relation. Applications 
of reciprocal relations arise here in domains such as preference learning, game 
theory and bioinformatics for representing preference relations, choice probabil- 
ities, winning probabilities, gene regulation, etc. The weight on the edge defines 
the real direction of such an edge. If the weight on the edge e = {v, v') is higher 
than 0.5, then the direction is from v to v' , but when the weight is lower than 
0.5, then the direction should be interpreted as inverted, for example, the edges 
from A to C in Figures [l] (a) and (e) should be interpreted as edges starting 
from A instead of C. If the relation is 3-valued as Q : ^ {0, 1/2, 1}, then 
we end up with a three-class ordinal regression setting instead of an ordinary 
regression setting. 

Interestingly, reciprocity can be easily incorporated in our framework. 

Proposition 3.3. Let ^' &e a feature mapping on and let h be a hypothesis 
defined by then the relation Q of type |Ip is reciprocal if $ is given by 



The proof is immediate. In addition, one can easily show that reciprocity as 
domain knowledge can be enforced in the dual formulation. Let us in the least 
restrictive form now consider the Kronecker product for 4', then one obtains for 

the kernel given by K^j^{e,e) = 



The following theorem shows that this kernel can represent any type of reciprocal 
relation. 

Theorem 3.4. Let 



$fl(e) = v') = ^{v, v') - v) . 



2{K'l'{v,v)K'''{v',v') - K'>'{v,v')K't'{v',v 



))• 



(6) 



R{v^) = {t\teC{v^)Av.v') 



t(v',v)} 
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be the space of all continuous antisymmetric relations from to K. If K'^ on 
V is universal, then for every function t € R{V^) and every e > 0, there exists 
a function h in the RKHS induced by the kernel K"^^ defined in such that 

T^&^ {\t{v,v')^h{v,v')\}<e. (7) 
The proof can be found in the appendix. 
3.2 Symmetric relations 

Symmetric relations form another important subclass of relations in our frame- 
work. As a specific type of symmetric relations, similarity relations constitute 
the underlying relation in many application domains where relations between 
objects need to be learned. Symmetric relations are formally defined as follows. 

Definition 3.5. A binary relation Q ; — > [0, 1] is called a symmetric relation 
if for all {v,v') G it holds that Q{v,v') = Q{v',v). 

Definition 3.6. A binary relation ft, : V'^ — > M is called a symmetric relation if 
for all {v,v') £ it holds that h(v,v') — h{v' ,v). 

Note that V preserves symmetry. For symmetric relations, edges in multi- 
graphs like Figure [l] become undirected. Applications arise in many domains 
and metric learning or learning similarity measures can be seen as special cases. 
If the relation is 2-valued as Q : ^ {0, 1}, then we end up with a classification 
setting instead of a regression setting. 

Just like reciprocal relations, it turns out that symmetry can be easily in- 
corporated in our framework. 

Proposition 3.7. Let be a feature mapping on and let h be a hypothesis 
defined by then the relation Q of type |Ip is symmetric if is given by 

$5(6) = $s(t^, v') = v') + v) . 

In addition, by using mathematical properties of the Kronecker product, one 
obtains in the dual formulation an edge kernel that looks very similar to the one 
derived for reciprocal relations. Let us again consider the Kronecker product 
for then one obtains for $5 the kernel K^g given by K^g{e,e) = 

2{K'^{v, v') + K'>'{v, v')K'>'{v', v)) . 

Thus, the substraction of kernels in the reciprocal case becomes an addition of 
kernels in the symmetric case. The above kernel has been used for predicting 
protein-protein interactions in bioinformatics [s] and it has been theoretically 



analyzed in 24 . More specifically, for some methods one has shown in the latter 
paper that enforcing symmetry in the kernel function yields identical results as 
adding every edge twice to the dataset, by taking each of the two nodes once 
as first element of the edge. Unlike many existing kernel-based methods for 



10 



pairwise data, the models obtained with these kernels are able to represent 
any reciprocal or symmetric relation respectively, without imposing additional 
transitivity properties of the relations. 

We also remark that for symmetry as well, one can prove that the Kronecker 
product edge kernel yields a model that is flexible enough to represent any type 
of underlying relation. 

Theorem 3.8. Let 

S{V')^{t\teC{V'),t(v,v')=t(v',v)} 

be the space of all continuous symmetric relations from to M. If K'^ on V 
is universal, then for every function t € »S'(V'^) and every e > 0, there exists a 
function h in the RKHS ^ induced by the kernel such that 

, max , {|i(w,w')-/i(w,w')|}<e- 



The proof is analogous to that of Theorem 3.4 (see appendix). 
As a side note, we remark that a symmetric and reciprocal version of the 
Cartesian kernel can be introduced as well. 



4 Relationships with fuzzy set theory 

The previous section revealed that specific Kronecker product edge kernels can 
be constructed for modeling reciprocal and symmetric relations, without requir- 
ing any further background about these relations. In this section we demonstrate 
that the Kronecker product edge kernels X*, Kf^^ and K'^g are particularly 
useful for modeling intransitive relations. Intransitive relations occur in a lot 



of real-world scenarios, like game playing 14 19 , competition between bacte- 
ria 1 8, 30, 32, 33, 40, 44 and fungi [5^, mating choice of lizards [49) and food choice 
of birds [63] , to name just a few. In an informal way, Figure [ij shows with the 
help of examples what transitivity means for symmetric and reciprocal relations 
that are crisp and graded. 

Despite the occurrence of intransitive relations in many domains, one has 
to admit that most applications are still characterized by relations that fulfill 
relatively strong transitivity requirements. For example, in decision making, 
preference modeling and social choice theory, one can argue that reciprocal re- 
lations like choice probabilities and preference judgments should satisfy certain 
transitivity properties, if they represent rational human decisions made after 
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59 . For symmetric relations as 



29 , when modeling similarity re- 



well-reasoned comparisons on objects 18 
well, transitivity plays an important role 
lations, metrics, kernels, etc. 

It is for this reason that transitivity properties have been studied extensively 
in fuzzy set theory and related fields. For reciprocal relations, one traditionally 



uses the notion of stochastic transitivity 36 
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Definition 4.1. Let g he an increasing [1/2, 1]^ — > [0, 1] mapping. A reciprocal 
relation Q [0, 1] is called g- stochastic transitive if for any (wi, V2, wa) S V'^ 

{Q{vi,V2) > l/2^Q{v2,v^) > 1/2) ^ Q{vi,v:i) > g{Q{vi,V2),Q{v2,Vi)) . 

Important special cases are weak stochastic transitivity when g{a,h) — 1/2, 
moderate stochastic transitivity when g{a^ h) = min(a, h) and strong stochastic 
transitivity when g{a,b) = max(a, 6). Alternative (and more general) frame- 
works are FG-transitivity |56] and cycle transitivity ^,'10' . For graded symmet- 



ric relations, the notion of T-transitivity has been put forward 12 39 



Definition 4.2. A symmetric relation Q : — )■ [0, 1] is called T -transitive with 
T a t-norm if for any {vi,V2,Vs) G V'^ 

T{Q{vi,V2),Qiv2,V3)) < Qiv^vs) . (8) 

Three important t-norms are the minimum t-norm rM(fl, b) = min(a, b), the 
product t-norm Tp(a, b) — ab and the Lukasiewicz t-norm Ti^{a, h) — max(a -f 
6-1,0). 

In addition, several authors have shown that various forms of transitivity give 
rise to utility representable or numerically representable relations, also called 
fuzzy weak orders - see e.g. |4|[6j[20j[34||36]. We will use the term ranking 
representability to establish a link with machine learning. We give a slightly 
specific definition that unifies reciprocal and symmetric relations. 

Definition 4.3. A reciprocal or symmetric relation Q : — > [0, 1] is called 
ranking representable if there exists a ranking function / : V — )• M such that for 
all {v,v') G it respectively holds that 

1. Q{v,v')=V{f{v) — f{v')) (reciprocal case) ; 

2. Q{v,v') —\'{f{v) + f{v')) (symmetric case) . 

The main idea is that ranking representable relations can be constructed 
from a utility function /. Ranking representable reciprocal relations correspond 
to directed acyclic graphs, and a unique ranking of the nodes in such graphs 
can be obtained with topological sorting algorithms. The ranking representable 
reciprocal relations of Figures [l] (a) and (e) for example yield the global ranking 
A >- B C. Interestingly, ranking representability of reciprocal relations and 
symmetric relations can be easily achieved in our framework by simplifying the 
joint feature mapping ^E*. Let ^'(w, v') = 0(u) such that if* simplifies to 

K%{e,e) = K'>'{v,v) + K'>'{v',v')- K'''{v,v')~ K'>'{v\v), 
Kfs{e,e) = K'l'{v,v) + K'>'{v',v')+K'f'{v,v') + K'l'{v\v), 

when $(«,«') = ^ji{v,v') or ^{v,v') — ^s{v,v'), respectively, then the follow- 
ing proposition holds. 

Proposition 4.4. The relation Q : — > [0, 1] given by ^ and h defined by 
with = ^fn (respectively A'* = ^fs) ranking representable reciprocal 
(respectively symmetric) relation. 
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The proof directly follows from the fact that for this specific kernel, h{v,v') 
can be respectively written as f{v) — f{v') and f{v) + f{v'). The kernel i^*^ 
has been initially introduced in |23| for ordinal regression and during the last 
decade it has been extensively used as a main building block in many kernel- 
based ranking algorithms. Since ranking representability of reciprocal relations 
implies strong stochastic transitivity of reciprocal relations, K'jj^ can represent 
this type of domain knowledge. 

The notion of ranking representability is powerful for reciprocal relations, 
because the majority of reciprocal relations satisfy this property, but for sym- 
metric relations it has a rather limited applicability. Ranking representability 
as defined above cannot represent relations that originate from an underlying 
metric or similarity measure. For such relations, one needs another connection 



with its roots in Euclidean metric spaces 22 



Definition 4.5. A symmetric relation Q : ~^ [0, 1] is called Euclidean rep- 
resentable if there exists a ranking function / : V — !■ M such that for all pairs 
{v, v') e it holds that 

Q{v, v') = V((/(«) - f{v')f{f{v) - f{v'))) , (9) 

with the transpose of a vector a. 

Euclidean representability as defined here basically can be seen as Euclidean 
embedding or Multidimensional Scaling in a z-dimensional space [69[ . In its 
most restrictive form, when z = 1, it implies that the symmetric relation can 
be constructed from the Euclidean distance in a one-dimensional space. When 
such a one-dimensional embedding can be realized, one global ranking of the 
objects can be found, similar to reciprocal relations. Nevertheless, although 
models of type ^ with z = 1 are sometimes used in graph inference [61] and 
semi-supervised learning [2], we believe that situations where symmetric rela- 
tions become Euclidean representable in a one-dimensional space occur very 
rarely, in contrast to reciprocal relations. The extension to z > 1 on the other 
hand does not guarantee the existence of one global ranking, then Euclidean 
representability still enforces some interesting properties, because it guarantees 
that the relation Q is constructed from a Euclidean metric space with a dimen- 
sion upper bounded by the number of nodes p. Moreover, this type of domain 
knowledge about relations can be incorporated in our framework. To this end, 
let ^{v,v') = and let = (j)(v) (g) {(f){v) ~ (f>(v')) such that 

becomes 

<LPK(e,e) = iK%{e,e)f 

= {K't'{v,v) + - K'>'{v,v') - K'''{v\v)Y . 

This kernel has been called the metric learning pairwise kernel by [60]. As a 
consequence, the vector of parameters w can be rewritten as an r x r matrix W 
where W^j corresponds to the parameter associated with {(l)i{v)~(pi{v')){(f)j{v)~ 
(j)j{v')) such that — Wj^. 
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Proposition 4.6. //W is positive semi-definite, then the symmetric relation 
Q : ^ [0, 1] given by with h defined by ^ and K"^ = i^MLPK 
Euclidean representable symmetric relation. 

See the appendix for the proof. Although the model established by -fi^MLPK 
does not result in a global ranking, this model strongly differs from the one es- 
tablished with K^g, since if^LPK '■^^^ only represent symmetric relations that 
exhibit transitivity properties. Therefore, one should definitely use i^MLPK 
when, for example, the underlying relation corresponds to a metric or a simi- 
larity relation, while the kernel K^g should be preferably used for symmetric 
relations for which no further domain knowledge can be assumed beforehand. 



5 Relationships with other machine learning al- 
gorithms 

As explained in Section 2, the transition from a standard classification or regres- 
sion setting to the setting of learning graded relations should be rather found 
in the specification of joint feature mappings over couples of objects, thereby 
naturally leading to the introduction of specific kernels. Any existing machine 
learning algorithm for classification or regression can in principle be adopted if 
joint feature mappings are constructed explicitly. Since kernel methods avoid 
this explicit construction, they can often outperform non-kernelized algorithms 



in terms of computational efficiency 47 . As a second main advantage, ker 



nel methods allow to express similarity scores for structured objects, such as 



strings, graphs and trees and text 48 . In our setting of learning graded rela- 
tions, this implies that one should plug these domain-specific kernel functions 
into ([5]) or the other pairwise kernels that are discussed in this paper. Such a 
scenario is in fact common practice in some applications of Kronecker product 
pairwise kernels, such as predicting protein-ligand compatibility in bioinformat- 
ics 28 . String kernels or graph kernels can be defined on various types of 
biological structures [62^ and Kronecker product pairwise kernels then combine 
these object-based kernels into relation-based kernels (thus, node kernels versus 
edge kernels). 

The edge kernels we discussed in this article can be utilized within a wide 
variety of kernel methods. Since we focus on learning graded relations, one 
naturally arrives at a regression setting. In the following section, we run some 
experiments with regularized least-squares methods, which optimize Q using a 
hypothesis space induced by kernels. The solution is found by simply solving a 



system of linear equations 41 46 48 54 



Apart from kernel methods, we briefly mention a number of other algorithms 
that are somewhat connected, even though they provide solutions for different 
learning problems. If pairwise relations are considered between objects of two 
different domains, one arrives at a learning setting that is referred to as predict- 



ing labels for dyadic data 37 . Examples of such settings include link prediction 



in bipartite graphs and movie recommendation for users. As such, one could 
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also argue that specific link prediction and matrix factorization methods could 
be applied in our setting as well, see e.g. (35 38 52 . However, these methods 
have been primarily designed for exploiting relationships in the output space, 
whereas feature representations of the objects are often not observed or simply 
irrelevant. Moreover, similar to the Cartesian pairwise kernel, these methods 
cannot be applied in situations where predictions need to be made for two new 
nodes that were not present in the training dataset. 

Another connection can be observed with multivariate regression and struc- 
tured output prediction methods. Such methods have been occasionally applied 
in settings where relations had to be learned 



21 . Also recall that structured 



output prediction methods use Kronecker product pairwise kernels on a regular 



basis to define joint feature representations of inputs and outputs 58 65 



In addition to predictive models for dyadic data, one can also detect connec- 
tions with certain information retrieval and pattern matching methods. How- 
ever, these methods predominantly use similarity as underlying relation, often 
in a purely intuitive manner, as a nearest neighbor type of learning, so they 
can be considered as much more restrictive. Consider the example of protein 
ranking 64 or algorithms like query by document 68 . These methods simply 



look for rankings where the most similar objects w.r.t. the query object appear 
on top, contrary to our approach, which should be considered as much more 
general, since we learn rankings from any type of binary relation. Nonetheless, 
similarity relations will of course still occupy a prominent place in our framework 
as an important special case. 



6 Experiments 

In the experiments, we test the ability of the pairwise kernels to model different 
types of relations, and the effect of enforcing prior knowledge about the proper- 
ties of the learned relations. To this end, we train the regularized least-squares 



(RLS) algorithm to regress the relation values 41 . We perform experiments on 
both symmetric and reciprocal relations, considering both synthetic and real- 
world data. In addition to the standard, symmetric and reciprocal Kronecker 
product pairwise kernels, we also consider the Cartesian kernel, the symmetric 
Cartesian kernel and the metric learning pairwise kernel. 



6.1 Synthetic data: learning similarity measures 

Experiments on synthetic data were conducted to illustrate the behavior of the 
different kernels in terms of the transitivity of the relation to be learned. A 
parametric family of cardinality-based similarity measures for sets was consid- 
ered as the relation of interest [Til. For two sets A and B, let us define the 
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Abbreviation Method 



MPRED 



OS 
'■MLPK 



Predicting the mean 

Kronecker Product Pairwise Kernel 

Symmetric Kronecker Product Pairwise Kernel 

Reciprocal Kronecker Product Pairwise Kernel 

Metric Learning Pairwise Kernel 

Cartesian Product Pairwise Kernel 

Symmetric Cartesian Pairwise Kernel 



Table 1: Methods considered in the experiments 



following cardinalities: 

Aa,b = \A\B\ + \B\A\, 

5a,b - \AnB\, 

VA,B = \{AyjBY\, 

then this family of similarity measures for sets can be expressed as: 

tAA^B + u5a,B + VVA,B 



S{A,B) 



t'/^A.B + u5a,B + VVA,B 



(10) 



with i, i', u and v four parameters. This family of similarity measures includes 
many well-known similarity measures for sets, such as the Jaccard coefficient 
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, the simple matching coefficient [50] and the Dice coefficient |16| . 
Three members of this family are investigated in our experiments. The 
first one is the Jaccard coefficient, corresponding to {t,t',u,v) = (0,1,1,0). 
The Jaccard coefficient is known to be TL-transitive. The second member that 
we investigate was originally proposed by 51 . It corresponds to {t,t',u,v) = 



(0, 1, 2, 2) and it does not satisfy Tl -transitivity, which is considered as a very 
weak transitivity condition. Conversely, the third member that we analyse has 
rather strong transitivity properties. It is given by {t,t',u,v) = (1,2, 1, 1) and 
it satisfies Tp-transitivity. 

Features and labels for all three members are generated as follows. First we 
generate 20-dimensional feature vectors consisting of statistically independent 
features that follow a Bernoulli distribution with tt = 0.5. Subsequently, the 
above-mentioned similarity measures are computed for each pair of features, 
resulting in a deterministic mapping between features and labels. Finally, to 
introduce some noise in the problem setting, 10% of the features are swapped in 
a last step from a zero to a one or vice versa. Figure [2] illustrates the distribution 
of the obtained similarity scores for a 100 x 100 matrix. 

In the experiments, we always generate three data sets, a training set for 
building the model, a validation set for hyperparameter selection, and a test set 
for performance evaluation. We perform two kinds of experiments. In the first 
experiment, we have a single set of 100 nodes. 500 node pairs are randomly 
sampled without replacement to the training, validation and test sets. Thus, 
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Color Key 



Value 



Figure 2: The distribution of similarity scores obtained on a 100 by 100 matrix 



Setting 


{t,t',u,v) 


MPRED 


-"-«> 




-^MLPK 




^cs 


Intransitive 


(0,1,2,2) 


0.01038 


0.00908 


0.00773 


0.00768 


0.00989 


0.00924 


TL-transitive 


(0,1,1,0) 


0.01514 


0.00962 


0.00781 


0.00805 


0.01155 


0.00941 


Tp-transitive 


(1,2,1,1) 


0.00259 


0.00227 


0.00192 


0.00188 


0.00248 


0.00231 



Table 2: The predictive performance on test data for the different types of 
relations and kernels. In this experiment, the task is to predict relation values 
for unknown edges in a partially observed relational graph. The performance 
measure is the mean squared error. 



Setting 


{t,t',u,v) 


MPRED 






-^MLPK 


Intransitive 

TL-transitive 

Tp-transitive 


(0,1,2,2) 
(0,1,1,0) 
(1,2,1,1) 


0.01032 
0.01515 
0.00259 


0.00995 
0.01236 
0.00251 


0.00936 
0.01166 
0.00236 


0.00971 
0.01453 
0.00242 



Table 3: The predictive performance on test data for the different types of 
relations and kernels. In this experiment, the task is to predict relation values 
for a completely new set of nodes. The performance measure is the mean squared 
error. 

the learning problem here is, given a subset of the relation values for a fixed 
set of nodes, to learn to predict missing relation values. This setup allows us to 
test also the Cartesian kernel, which is unable to generalize to completely new 
pairs of nodes. In the second experiment, we generate three separate sets of 100 
nodes for the training, validation and test sets, and sample from each of these 
500 edges. This experiment allows us to test the generalization capability of 
the learned models with respect to new couples of nodes (i.e., previously unseen 
nodes). Here, the Cartesian kernel is not applicable, and thus not included in 
the experiment. The experiments are repeated 100 times, the presented results 
are means over the repetitions. For statistical significance testing, we use the 
paired Wilcoxon-signed-rank test with significance level 0.05. All pairs of kernels 
are compared, and the conservative Bonferroni correction is applied to take into 
account multiple hypothesis testing, meaning that the required p- value is divided 
by the number of comparisons. The Gaussian RBF kernel was considered at the 
node level. The used performance measure is the mean squared error (MSE). 
For training RLS we solve the corresponding system of linear equations using 
matrix factorization, by considering an explicit regularization parameter. A 
grid search is conducted to select the width of the Gaussian RBF kernel and the 
regularization parameter of the RLS algorithm. Both parameters are selected 
from the range 2~^'^, . . . , 2^. 

The results for the experiments are presented in Tables [2] and [3] In both cases 
all the kernels outperform the mean as prediction, meaning that they are able to 
model the underlying relations. For all the learning methods, the error is lower 
in the first experiment than in the second one, demonstrating that it is easier to 
predict relations between known nodes, than to generalize to a new set of nodes. 
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Enforcing symmetry is clearly beneficial, as the symmetric Kronecker product 
pairwise kernel always outperforms the standard Kronecker product pairwise 
kernel, and the symmetric Cartesian kernel always outperforms the standard 
one. Comparing the Kronecker and Cartesian kernels, the Kronecker one leads 
to clearly lower error rates. With the exception of the TL-transitive case in 
the second experiment, MLPK turns out to be highly successful in modeling 
the relations, probably due to enforcing symmetry of the learned relation. In 
the first experiment, all the differences are statistically significant, apart from 
the difference between the symmetric Kronecker product pairwise kernel and 
MLPK for the intransitive case. In the second experiment, all the differences are 
statistically significant. We can conclude that including prior knowledge about 
symmetry really helps boosting the predictive performance in this problem. 



6.2 Learning the similarity between documents 

In the second experiment, we compare the ordinary and symmetric Kro- 
necker pairwise kernels on a real-world data set based on newsgroups docu- 
ment^ The data is sampled from 4 newsgroups: rec.autos, rec. sport. baseball, 
comp.sys.ibm.pc. hardware and comp. windows. x. The aim is to learn to pre- 
dict the similarity of two documents as measured by the number of common 
words they share. The node features correspond to the number of occurrences 
of a word in a document. Unlike the previous experiment, the feature repre- 
sentation is very high-dimensional and sparse, as there are more than 50000 
possible features, the majority of which are zero for any given document. First, 
we sample separate training, validation and test sets each consisting of 1000 
nodes. Second, we sample edges connecting the nodes in the training and val- 
idation set using exponentially growing sample sizes to measure the effect of 
sample size on the differences between the kernels. The sample size grid is 
[100, 200, 400, . . . , 102400]. Again, we sample only edges with different starting 
and end nodes. When computing the test performance, we consider all the edges 
in the test set, except those starting and ending at the same node. The linear 
kernel is used at the node level. We train the RLS algorithm using conjugate 



gradient optimization with early stopping 42 , optimization is terminated once 
the MSE on the validation set has failed to decrease for 10 consecutive iterations. 
Since we rely on the regularizing effect of early stopping, a separate regulariza- 
tion parameter is not needed in this experiment. We do not include other types 
of kernels than the Kronecker product pairwise kernels in the experiment. To 
the best of our knowledge, no algorithms that scale to the considered experiment 
size exist for the other kernel functions. Hence, this experiment mainly aims 
to illustrate the computational advantages of the Kronecker product pairwise 
kernel. The mean as prediction achieves an MSE around 145 on this dataset. 

The results are presented in Figure [Sj Even for 100 pairs the errors arc for 
both kernels much lower than the results for the mean as prediction, showing 
that the RLS algorithm succeeds with both kernels in learning the underlying 



Available at: ^http : //people ■ csail .mit . edu/jreimie/20Me«sgroups/ 
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Figure 3: The comparison of the ordinary Kronecker product pairwise kernel if* 
and the symmetric Kronecker product pairwise kernel K^g on the Newsgroups 
dataset. The mean squared error is shown as a function of the training set size. 

relation. Increasing the training set size leads to a decrease in test error. Using 
the prior knowledge about the symmetry of the learned relation is clearly helpful. 
The symmetric kernel achieves for all sample sizes a lower error than the ordinary 
Kronecker product pairwise kernel and the largest differences are observed for 
the smallest sample sizes. For 100 training instances, the error is almost halved 
by enforcing symmetry. 

6.3 Competition between species 

In this final experiment we evaluate the performance of the ordinary and re- 
ciprocal Kronecker pairwise kernels and the metric learning pairwise kernel on 
simulated data from an ecological model. The setup is based on the one de- 
scribed in [l]. This model provides an elegant explanation for the coexistence 
of multiple species in the same habitat, a problem that has puzzled ecologists 
for decades [26| . 

Imagine n species sharing a habitat and struggling for their share of the re- 
sources. One species can dominate another species based on k so-called limiting 
factors. A limiting factor defines an attribute that can give a fitness advantage. 
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for example in plants, such as the ability to photosynthesize, the ability to draw 
minerals from the soil, resistance to diseases, etc. Each species can score better 
or worse on each of its k limiting factors. The degree to which one species can 
dominate a competitor is relative to the number of limiting factors for which it 
is superior. All possible interactions can thus be represented in a tournament. 
In this framework relations are reciprocal and often intransitive. 

For this simulation 400 species were simulated with 10 limiting factors. The 
value of each limiting factor is for each species drawn from a random uniform 
distribution between and 1. Thus, any species v can be represented by a 
vector f of length k with the limiting factors as elements. The probability that 
a species v dominates species v' can easily be calculated: 

k 

Qiv,v') = lJ2Hin~n), (11) 

where H(x) is the Heaviside step function. 

Of the 400 species, 200, 100 and 100 were used for generating training, vali- 
dation and testing data. For each subset, the complete tournament matrix was 
determined using (11). From those matrices 1200 interactions were sampled 
for training, 600 for model validation and 600 for testing. No combination of 
species was used more than once. Using the limiting factors as features, we try 
to regress the probability that one species dominates another one using the ordi- 
nary and reciprocal Kronecker product pairwise kernels and the metric learning 
pairwise kernel. Again, the Gaussian kernel is applied as the node kernel. The 
validation set is used to determine the optimal regularization parameter and 
kernel width parameter from the grids 2"'^", 2^^^ . . ., 2^ and 2"-'^°, 2~^ . . ., 2^. 
To obtain statistically significant results the setup is repeated 100 times. 



Table 4: The predictive performance on test data for the different types of 
kernels. The performance measure is the mean squared error. 



Kernel 


MPRED 






^MLPK 


MSE 


0.02795 


0.01082 


0.01067 


0.02877 



The results are shown in Table |4j The Wilcoxon-signed-rank test with sig- 
nificance level 0.05 is used for significance testing, and a conservative Bonferroni 
correction is applied for multiple hypothesis testing. All differences are statisti- 
cally significant. 

The metric learning pairwise kernel gives rise to worse predictions than the 
mean as prediction. This is not surprising, as the MLPK cannot learn reciprocal 
relations. The ordinary Kronecker product pairwise kernel performs good and 
the reciprocal Kronecker product pairwise kernel performs even better. All 
the differences are statistically significant. The results show that using the 
information on the types of relations to be learned can boost the accuracy of 
the predictions. 
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7 Conclusion 



A general kernel-based framework for learning various types of graded relations 
was presented in this article. This framework extends existing approaches for 
learning relations, because it can handle crisp and graded relations. A Kronecker 
product feature mapping was proposed for combining the features of pairs of 
objects that constitute a relation (edge level in a graph), and it was shown 
that this mapping leads to a class of universal approximators, if an appropriate 
kernel is chosen on the object level (node level in a graph). 

In addition, we clarified that domain knowledge about the relation to be 
learned can be easily incorporated in our framework, such as reciprocity and 
symmetry properties. Experimental results on synthetic and real-world data 
clearly demonstrate that this domain knowledge really helps in improving the 
generalization performance. Moreover, important links with recent develop- 
ments in fuzzy set theory and decision theory can be established, by looking at 
transitivity properties of relations. 
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Appendix 

7.1 Formal definitions 

Definition 7.1. The Kronecker product of two matrices M and N is defined 
as 



Definition 7.2 ( 53|). A continuous kernel K on a compact metric space V 
(i.e. V is closed and bounded) is called universal if the RKHS induced by K is 
dense in C(V), where C(V) is the space of all continuous functions / : V — > M. 
That is, for every function f € C(V) and every e > 0, there exists a set of input 
points {vi}^i G V and real numbers {ai}^i, with m € N, such that 



Accordingly, the hypothesis space induced by the kernel K can approximate any 
function in C(V) arbitrarily well, and hence it has the universal approximating 
property. 
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The following result is in the literature known as the Stone- Weierstrafi the- 
orem (see e.g [45]): 

Theorem 7.3 (Stone- Weierstrai3). Let V be a compact metric space and let 
C(V) be the set of real-valued continuous functions on V. If A C C(V) is a 
subalgebra ofC{V), that is, 

yfiv),9{v) eA,reR: f{v) + rg{v) G A, f{v)g{v) G A 

and A separates points in V, that is, 

V-y, v' eV,v v' : 3g e A: g{v) ^ g{v'), 

and A does not vanish at any point in V, that is, 

yveV:3geA: g{v) ^ 0, 

then A is dense in C(V). 

7.2 Proofs 



Proof. (Theorem 2.1 1 Let us define 

A®A = {t\ t{v, v') = g{v)u{v'), g,ueA} (12) 

for a compact metric space V and a set of functions A C C(V). We observe that 
the RKHS of the kernel can be written as H(E)'H, where H is the RKHS of 
the kernel K'f'. 

Let e > and let t e C(V) (E) C{V) be an arbitrary function which can, 
according to (12), be written as t{v,v') = g{v)u{v'), where g,u G C(V). By 
definition of the universality property, H is dense in C(V). Therefore, T-L contains 
functions g, u such that 

max{|5(u) — g{v)\} < e, max{|u(i;) — u(w)|} < e, 

where e is a constant for which it holds that 

max {\eg{v) \ + \€u{v') \ -f e^} < e. 

Note that, according to the extreme value theorem, the maximum exists due to 
the compactness of V and the continuity of the functions g and u. Now we have 

m(ix {\t{v,v') - g{v)u{v')\} 

< max {\t{v,v') - g{v)u{v')\ + \eg{v)\ + \eu{v')\ +e'^} 

= max {|e5(i;)|-f|eu(i;')|+e'} 
v,v'ev 

<e, 

which confirms the density of in C(V) «) C(V). 
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According to Tychonoff's theorem, is compact if V is compact. It is 
straightforward to see that C(V) (8) C{V) is a subalgebra of C(V^), it separates 
points in V^, it vanis hes at no point of C(V^), and it is therefore dense in C(V^) 
Consequently, "H (g) "H is also dense in C(V^), and is a 

□ 



due to Theorem 



7.3 



universal kernel on £ . 



Proof. (Theorem 3.4 1 Le t e > and t G R{V^) be an arbitrary function. 

the RKHS of the kernel defined in ([5| is dense 



According to Theorem 2.1 

in C(V^). Therefore, we can select a set of edges and real numbers {cti}^i 
such that the function 



i=l 

belonging to the RKHS of the kernel ^ fulfills 

. ™^x,2{|t(t;,u') -4u(i;,i;')|} < ^e- 
We observe that, because t{v^v') = —t{v'^v)^ the function u also fulfills 



(13) 



and hence 



Let 



, max {|i(w,i;')+4M(w',w)|}<-e 



max {\Au{v,v') + 4:u{v' ,v)\} < e . 



7(w, w') = 2u{v, v') + 2u{v', v) . 



(14) 



Due to (14), we have 



(15) 



Now, let us consider the function h{v^v') — 

m 



which is obtained from u by replacing kernel ^ with kernel (|6|. We observe 
that 



h{v,v') = 2u{v,v') - 2u{v',v) 
= 4:u{v, v') — 7(u, v'). 



(16) 



By combining (13 1, (15) and (16), we observe that the function h fulfills ([t]). 



□ 
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Proof. (Proposition 



4.6 1 The model that wc consider can be written as: 



Qiv,v') - V{{cj,iv) - ^{v')fWicj,{v) - </>(«'))) . 

The connection with ([9| then immediately follows by decomposing W as W = 
U"^U with U an arbitrary matrix. The specific case of z = 1 is obtained when 
U can be written as a single-row matrix. □ 
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